{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "# businesses\n",
    "bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')\n",
    "\n",
    "# inspections\n",
    "insp = pd.read_csv(\"data/inspections.csv\")\n",
    "\n",
    "# violations\n",
    "viol = pd.read_csv(\"data/violations.csv\")\n",
    "\n",
    "# DAWN\n",
    "colspecs = [(0,6), (14,29), (33,35), (35, 37), (37, 39), (1213, 1214)]\n",
    "varNames = [\"id\", \"wt\", \"age\", \"sex\", \"race\",\"type\"]\n",
    "dawn = pd.read_fwf('data/DAWN-Data.txt', colspecs=colspecs, header=None, index_col=0,\n",
    "                   names = varNames)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(ch:wrangling_checks)=\n",
    "# Quality Checks\n",
    "\n",
    "Once your data are in a table and you understand the scope and granularity, it's time to inspect for quality. You may have come across errors in the source as you examined and wrangled the file into a dataframe. In this section, we describe how to continue this inspection and carry out a more comprehensive assessment of the quality of the features and their values.  We consider data quality from four vantage points: \n",
    "\n",
    "Scope\n",
    ": Do the data match your understanding of the population? \n",
    "\n",
    "Measurements and values\n",
    ": Are the values reasonable?\n",
    "\n",
    "Relationships\n",
    ": Are related features in agreement?\n",
    "\n",
    "Analysis\n",
    ": Which features might be useful in a future analysis? \n",
    "\n",
    "We describe each of these points in turn, beginning with scope."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality Based on Scope \n",
    "\n",
    "In {numref}`Chapter %s <ch:data_scope>`, we addressed whether or not the data that have been collected can adequately address the problem at hand. There, we identified the target population, access frame, and sample in collecting the data.\n",
    "That framework helps us consider possible limitations that might impact the generalizability of our findings.\n",
    "\n",
    "While these broader data-scope considerations are important\n",
    "as we deliberate our final conclusions, they are also useful for\n",
    "checking data quality.\n",
    "For example, for the San Francisco restaurant inspections data introduced in {numref}`Chapter %s <ch:files>`, a side investigation tells us that zip codes in the city should start with 941. \n",
    "But a quick check shows that several zip codes begin with other digits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "92672        1\n",
       "64110        1\n",
       "94120        1\n",
       "            ..\n",
       "94621        1\n",
       "941033148    1\n",
       "941          1\n",
       "Name: postal_code, Length: 10, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bus['postal_code'].value_counts().tail(10)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This verification using scope helps us spot potential problems."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As another example, a bit of background reading at [Climate.gov](https://www.climate.gov/) and [NOAA](https://www.noaa.gov/news-release/carbon-dioxide-now-more-than-50-higher-than-pre-industrial-levels) on the topic of atmospheric CO2 reveals that typical measurements are about 400 ppm worldwide. So we can check whether the monthly averages of CO2 at Mauna Loa lie between 300 and 450 ppm.\n",
    "\n",
    "Next, we check data values against codebooks and the like. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality of Measurements and Recorded Values\n",
    "\n",
    "We can use also check the quality of measurements by considering what might be a reasonable value for a feature. For example, imagine what might be a reasonable range for the number of violations in a restaurant inspection: possibly, 0 to 5. Other checks can be based on common knowledge of ranges: a restaurant inspection score must be between 0 and 100; months run between 1 and 12.  We can use documentation to tell us the expected values for a feature. For example, the type of emergency room visit in the DAWN survey, introduced in {numref}`Chapter %s <ch:files>`, has been coded as 1, 2, ..., 8 (see {numref}`Figure %s <DAWN_codebook>`). So we can confirm that all values for the type of visit are indeed integers between 1 and 8."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{figure} figures/DAWN_codebook.png\n",
    "---\n",
    "name: DAWN_codebook\n",
    "---\n",
    "\n",
    "Screenshot of the description of the emergency room visit type (CASETYPE) variable in the DAWN survey (the typo SUICICDE appears in the actual codebook)\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also want to ensure that the data type matches our expectations. For example, we expect a price to be a number, whether or not it's stored as integer, floating point, or string.  Confirming that the units of measurement match what is expected can be another useful quality check to perform (for example, weight values recorded in pounds, not kilograms). We can devise checks for all of these situations.  \n",
    "\n",
    "Other checks can be devised by comparing two related features."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality Across Related Features\n",
    "\n",
    "At times, two features have built-in conditions on their values that we can cross-check for internal consistency. For example, according to the documentation for the DAWN study, alcohol consumption is only considered a valid reason for a visit to the ER for patients under age 21, so we can check that any record with \"alcohol\" for the type of visit has an age under 21. A cross-tabulation of the features `type` and `age` can confirm that this constraint is met:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>type</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>age</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-8</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>21</td>\n",
       "      <td>5</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>6</td>\n",
       "      <td>20</td>\n",
       "      <td>6231</td>\n",
       "      <td>313</td>\n",
       "      <td>4</td>\n",
       "      <td>2101</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>2</td>\n",
       "      <td>15</td>\n",
       "      <td>1774</td>\n",
       "      <td>119</td>\n",
       "      <td>4</td>\n",
       "      <td>119</td>\n",
       "      <td>61</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>914</td>\n",
       "      <td>121</td>\n",
       "      <td>2433</td>\n",
       "      <td>2595</td>\n",
       "      <td>1183</td>\n",
       "      <td>48</td>\n",
       "      <td>76</td>\n",
       "      <td>4563</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>817</td>\n",
       "      <td>796</td>\n",
       "      <td>4953</td>\n",
       "      <td>3111</td>\n",
       "      <td>1021</td>\n",
       "      <td>95</td>\n",
       "      <td>44</td>\n",
       "      <td>6188</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>983</td>\n",
       "      <td>1650</td>\n",
       "      <td>0</td>\n",
       "      <td>4404</td>\n",
       "      <td>1399</td>\n",
       "      <td>170</td>\n",
       "      <td>48</td>\n",
       "      <td>9614</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1068</td>\n",
       "      <td>1965</td>\n",
       "      <td>0</td>\n",
       "      <td>5697</td>\n",
       "      <td>1697</td>\n",
       "      <td>140</td>\n",
       "      <td>62</td>\n",
       "      <td>11408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>957</td>\n",
       "      <td>1748</td>\n",
       "      <td>0</td>\n",
       "      <td>5262</td>\n",
       "      <td>1527</td>\n",
       "      <td>100</td>\n",
       "      <td>60</td>\n",
       "      <td>10296</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1847</td>\n",
       "      <td>3411</td>\n",
       "      <td>0</td>\n",
       "      <td>10221</td>\n",
       "      <td>2845</td>\n",
       "      <td>113</td>\n",
       "      <td>115</td>\n",
       "      <td>18366</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>1616</td>\n",
       "      <td>3770</td>\n",
       "      <td>0</td>\n",
       "      <td>12404</td>\n",
       "      <td>3407</td>\n",
       "      <td>75</td>\n",
       "      <td>150</td>\n",
       "      <td>18381</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>616</td>\n",
       "      <td>1207</td>\n",
       "      <td>0</td>\n",
       "      <td>12291</td>\n",
       "      <td>2412</td>\n",
       "      <td>31</td>\n",
       "      <td>169</td>\n",
       "      <td>7109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>205</td>\n",
       "      <td>163</td>\n",
       "      <td>0</td>\n",
       "      <td>24085</td>\n",
       "      <td>2218</td>\n",
       "      <td>12</td>\n",
       "      <td>308</td>\n",
       "      <td>1537</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "type     1     2     3      4     5    6     7      8\n",
       "age                                                  \n",
       "-8       2     2     0     21     5    1     1     36\n",
       " 1       0     6    20   6231   313    4  2101     69\n",
       " 2       8     2    15   1774   119    4   119     61\n",
       " 3     914   121  2433   2595  1183   48    76   4563\n",
       " 4     817   796  4953   3111  1021   95    44   6188\n",
       " 5     983  1650     0   4404  1399  170    48   9614\n",
       " 6    1068  1965     0   5697  1697  140    62  11408\n",
       " 7     957  1748     0   5262  1527  100    60  10296\n",
       " 8    1847  3411     0  10221  2845  113   115  18366\n",
       " 9    1616  3770     0  12404  3407   75   150  18381\n",
       " 10    616  1207     0  12291  2412   31   169   7109\n",
       " 11    205   163     0  24085  2218   12   308   1537"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "display_df(pd.crosstab(dawn['age'], dawn['type']), rows=12)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The cross-tabulation confirms that all of the alcohol cases (`type` is 3) have an age under 21 (these are coded as 1, 2, 3, and 4). The data values are as expected.  \n",
    "\n",
    "One last type of quality check pertains to the amount of information found in a feature. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quality for Analysis\n",
    "\n",
    "Even when data pass the previous quality checks, problems can arise with its usefulness. For example, if all but a handful of values for a feature are identical, then that feature adds little to the understanding of underlying patterns and relationships.  Or if there are too many missing values, especially if there is a discernible pattern in the missing values, our findings may be limited. Plus, if a feature has many bad/corrupted values, then we might question the accuracy of even those values that fall in the appropriate range."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see in the following code that the type of restaurant inspection in San Francisco can be either routine or from a complaint. Since only one of the 14,000+ inspections was from a complaint, we lose little if we drop this feature, and we might also want to drop that single inspection since it represents an anomaly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "routine      14221\n",
       "complaint        1\n",
       "Name: type, dtype: int64"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.value_counts(insp['type'])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once we find problems with our data, we need to figure out what to do."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fixing the Data or Not\n",
    "\n",
    "When you uncover problems with the data, essentially you have four options: leave the data as is; modify values; remove features; or drop records.  "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Leave it as is\n",
    ": Not every unusual aspect of the data needs to be fixed. You might have discovered a characteristic of your data that will inform you about how to do your analysis and otherwise does not need correcting. Or you might find that the problem is relatively minor and most likely will not impact your analysis, so you can leave the data as is. Or, you might want to replace corrupted values with `NaN`.\n",
    "\n",
    "Modify individual values\n",
    ": If you have figured out what went wrong and can correct the value, then you can opt to change it. In this case, it's a good practice to create a new feature with the modified value and preserve the original feature, like in the CO2 example."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remove a column\n",
    ": If many values in a feature have problems, then consider eliminating that feature entirely. Rather than exclude a feature, there may be a transformation that allows you to keep the feature while reducing the level of detail recorded."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Drop records\n",
    ": In general, we do not want to drop a large number of observations from a dataset without good reason. Instead, try to scale back your investigation to a particular subgroup of the data that is clearly defined by some criteria, and does not simply correspond dropping records with corrupted values. When you discover that an unusual value is in fact correct, you still might decide to exclude the record from your analysis because it's so different from the rest of your data and you do not want it to overly influence your analysis. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Whatever approach you take, you will want to study the possible impact of the changes that you make on your analysis.  For example, try to determine whether the records with corrupted values are similar to one another, and different from the rest of the data."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Quality checks can reveal issues in the data that need to be addressed before proceeding with analysis. \n",
    "One particularly important type of check is to look for missing values. We suggested\n",
    "that there may be times when you want to replace corrupted data values with\n",
    "`NaN`, and hence treat them as missing. At other times, data might arrive\n",
    "missing. What to do with missing data is an important topic, and there is a lot\n",
    "of research on this problem; we cover ways to address missing data in the next section."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
